Project-Team:ZENITH

Inria | Raweb 2017 | Presentation of the Project-Team ZENITH | ZENITH Web Site


	PDF	e-Pub

Previous |

Home | Next next

Section: New Results

Data Search

Adversarial Autoencoders For Novelty Detection

Participants : Valentin Leveau, Alexis Joly.

In this work [40], we addressed the problem of novelty detection, i.e recognizing at test time if a data item comes from the training data distribution or not. We focus on Adversarial autoencoders (AAE) that have the advantage to explicitly control the distribution of the known data in the feature space. We show that when they are trained in a (semi-)supervised way, they provide consistent novelty detection improvements compared to a classical autoencoder. We further improve their performance by introducing an explicit rejection class in the prior distribution coupled with random input images to the autoencoder.

Going deeper in the automated identification of Herbarium specimens

Participants : Alexis Joly, Herve Goeau.

Hundreds of herbarium collections have accumulated a valuable heritage and knowledge of plants over several centuries. Recent initiatives started ambitious preservation plans to digitize this information and make it available to botanists and the general public through web portals. However, thousands of sheets are still unidentified at the species level while numerous sheets should be reviewed and updated following more recent taxonomic knowledge. These annotations and revisions require an unrealistic amount of work for botanists to carry out in a reasonable time. Computer vision and machine learning approaches applied to herbarium sheets are promising but are still not well studied compared to automated species identification from leaf scans or pictures of plants in the field. In this work [14], we proposed to study and evaluate the accuracy with which herbarium images can be potentially exploited for species identification with deep learning technology. In addition, we proposed to study if the combination of herbarium sheets with photos of plants in the field is relevant in terms of accuracy, and finally, we explore if herbarium images from one region that has one specific flora can be used to do transfer learning to another region with other species; for example, on a region under-represented in terms of collected data. This is, to our knowledge, the first study that uses deep learning to analyze a big dataset with thousands of species from herbaria. Results show the potential of Deep Learning on herbarium species identification, particularly by training and testing across different datasets from different herbaria. This could potentially lead to the creation of a semi, or even fully automated system to help taxonomists and experts with their annotation, classification, and revision works.

Crowdsourcing Thousands of Specialized Labels: a Bayesian active training approach

Participants : Maximilien Servajean, Alexis Joly, Dennis Shasha, Julien Champ, Esther Pacitti.

The use of crowdsourced and more generally user-generated annotations became the de facto methodology for building training data in a variety of data indexing and search tasks. When the labels correspond to well known or easy-to-learn concepts, it is straightforward to train the annotators by giving a few examples with known answers. Neither is true when there are thousands of complex domain specific labels. In this work, we focused on the particular case of crowdsourcing domain-specific annotations that usually require hard expert knowledge (such as plant species names, architectural styles, medical diagnostic tags, etc.). We considered that common knowledge is not sufficient to perform the task but any people can be taught to recognize a small subset of domain-specific concepts. In such a context, it is best to take advantage of the various capabilities of each annotator through teaching (annotators can enhance their knowledge), assignment (annotators can be focused on tasks they have the knowledge to complete) and inference (different annotator propositions can be aggregated to enhance labeling quality). In this work [20], we proposed a set of data-driven algorithms to (i) train image annotators on how to disambiguate among automatically generated candidate labels, (ii) evaluate the quality of annotators’ label suggestions and (iii) weight predictions. The algorithms adapt to the skills of each annotator both in the questions asked and the weights given to their answers. The underlying judgements are Bayesian, based on adaptive priors. We measured the benefits of these algorithms by a live user experiment related to image-based plant identification involving around 1,000 people (at the origin of ThePlantGame, see Software section). The proposed methods yield huge gains in annotation accuracy. While a standard user could correctly label around 2% of our data, this goes up to 80% with machine learning assisted training and almost 90% when doing a weighted combination of several annotators’ labels.

Evaluation of Content-Based Biodiversity Identification techniques

Participants : Alexis Joly, Herve Goeau, Jean-Christophe Lombardo.

We ran a new edition of the LifeCLEF evaluation campaign [26] with the involvement of 15 research teams working on content-based biodiversity identification worldwide. The main novelties of the 2017 edition of LifeCLEF compared to the previous years were the following:

Scalability: To fully reach its objective, an evaluation campaign such as LifeCLEF requires a long term research effort so as to (i) encourage non incremental contributions, (ii) measure consistent performance gaps and (iii), progressively scale up the problem. Therefore, the number of species was increased considerably between the 2016 and 2017 editions. The plant task, in particular, made a big jump with 10,000 species instead of 1,000 species in the training set. This makes it one of the largest image classification benchmark. Besides, the data set of the bird task was increased by $50 %$ up to 1,500 species which makes it the largest audio classification benchmark as well.
Noisy vs. clean data: The focus of the plant task this year was to study the impact of training identification systems on noisy Web data rather then clean data [35]. Collecting clean data massively is actually prohibitive in terms of human cost whereas noisy Web data can be collected at a very cheap cost. Therefore, we built two large-scale datasets illustrating the same 10K species: one with clean labels coming from the Web platform Encyclopedia Of Life, and one with a high degree of noise - domain noise as well as category noise - crawled from the Web without any filtering. The main conclusion of our evaluation was that convolutional neural networks (CNN) appear to be amazingly effective in the presence of noise in the training set. All networks trained solely on the noisy dataset did outperform the same models trained on the trusted data. Even at a constant number of training iterations (i.e. at a constant number of images passed to the network), it was more profitable to use the noisy training data. This means that diversity in the training data is a key factor to improve the generalization ability of deep learning. The noise itself seems to act as a regularization of the model. Beyond technical aspects, this conclusion is of high importance in botany and biodiversity informatics in general. Data quality and data validation issues are of crucial importance in these fields and our conclusion is somehow disruptive.
Time-coded soundscapes: As the soundscapes data appeared to be very challenging in 2016 (with an accuracy below 15%), we introduced in 2017 new soundscape recordings containing time-coded bird species annotations thanks to the involvement of expert ornithologists. In total, 4,5 hours of audio recordings were collected and annotated manually with more than 2000 identified segments. The main outcome of our evaluation [36], was that the best performing system on that data was based on a purely image-based convolutional neural network architecture (Inception V4) applied to a standard time-frequency representation. This shows the convergence of the best performing methods whatever the targeted domain.
New organisms and identification scenarios: The SeaCLEF task was extended with novel scenarios involving new organisms, i.e (i) salmons detection for the monitoring of water turbine, and (ii), marine animal species recognition using weakly-labeled images and relevance ranking.

Pl@ntNet Business Venture proposal

Participants : Alexis Joly, Herve Goeau, Antoine Affouard, Jean-Christophe Lombardo.

The ACM Multimedia conference (rank A) introduced in 2017 a new "Business Venture Track" soliciting business venture proposals that combine multimedia technology. The aim is to bridge the gap between academia and industry on multimedia research, innovation and application. The track was open for submissions by all multimedia researchers and entrepreneurs. In this context, we have been working on a business venture proposal around the Pl@ntNet project that has been accepted for publication [25]. Our business proposal is to allow enterprises or organizations to set up their own private collaborative workflow within Pl@ntNet information system. The main added value is to allow them to work on their own business object (e.g. plant disease diagnostic, deficiency measurements, railway lines maintenance, etc.) and with their own community of contributors and end-users (employees, sales representatives, clients, observers network, etc.). This business idea answers to a growing demand in agriculture and environmental economics. Actors in these domains acknowledge that machine learning techniques are mature enough but the lack of training data and efficient tools to collect them remains a major problem. A collaborative platform like Pl@ntNet extended with the technical innovations presented in this paper is the ideal tool to bridge this gap. It will initiate a powerful positive feedback loop boosting the production of training data while improving the work of the employees.

Previous |

Home | Next next